Identification of novel biomarkers and immune infiltration features of recurrent pregnancy loss by machine learning

Recurrent pregnancy loss (RPL) is a complex reproductive disorder. The incompletely understood pathophysiology of RPL makes early detection and exact treatment difficult. The purpose of this work was to discover optimally characterized genes (OFGs) of RPL and to investigate immune cell infiltration in RPL. It will aid in better understanding the etiology of RPL and in the early detection of RPL. The RPL-related datasets were obtained from the Gene Expression Omnibus (GEO), namely GSE165004 and GSE26787. We performed functional enrichment analysis on the screened differentially expressed genes (DEGs). Three machine learning techniques are used to generate the OFGs. A CIBERSORT analysis was conducted to examine the immune infiltration in RPL patients compared with normal controls and to investigate the correlation between OFGs and immune cells. Between the RPL and control groups, 42 DEGs were discovered. These DEGs were found to be involved in cell signal transduction, cytokine receptor interactions, and immunological response, according to the functional enrichment analysis. By integrating OFGs from the LASSO, SVM-REF, and RF algorithms (AUC > 0.880), we screened for three down-regulated genes: ZNF90, TPT1P8, FGF2, and an up-regulated FAM166B. Immune infiltration study revealed that RPL samples had more monocytes (P < 0.001) and fewer T cells (P = 0.005) than controls, which may contribute to RPL pathogenesis. Additionally, all OFGs linked with various invading immune cells to varying degrees. In conclusion, ZNF90, TPT1P8, FGF2, and FAM166B are potential RPL biomarkers, offering new avenues for research into the molecular mechanisms of RPL immune modulation and early detection.

Recurrent pregnancy loss (RPL) is a distressing pregnancy disorder defined as the presence of two or more clinically recognized pregnancy losses before 20-24 weeks of gestation 1 . RPL affects about 2.5% of women who are trying to get pregnant. Recurrent miscarriages can be caused by a variety of conditions, including genetic, anatomical, endocrine, and immune-related illnesses 2 . However, the cause of about 50% of RPL instances is still unknown, and the etiology of RPL has not yet been thoroughly clarified 3 . As a result, progress in the development of accurate diagnosis and early prediction of recurrent miscarriage is stymied.
A woman's endometrial immune system is essential to the success of her pregnancy because it functions as a semi-allograft of the maternal host. Early in pregnancy, roughly 40% of the decidua's cells are endometriumresident immune cells, which serve regulatory roles during embryo implantation to guarantee maternal tolerance of the embryo 4 . Decidual lymphocytes play a crucial role in the early stages of pregnancy, among other things by removing apoptotic cells, spotting infections, encouraging trophoblast invasion, and controlling decidualization. Activated natural killer (NK) cells release growth-promoting factors that promote fetal maturation, and CD49a + Emos + NK cells recognize HLA-G expressed on extravillous trophoblasts 5,6 . Depletion of CD4+CD25+Treg cells results in pregnancy loss in mice because regulatory T (T reg) cells promote tolerance between fetal and maternal cells 4,7 . Furthermore, the quantity of tolerogenic dendritic cells (DCs) in the endometrium was dramatically decreased in the mid-luteal phase of RPL women, which induced the differentiation of Treg cells in the endometrium and other tissues 8,9 . At the same time, depletion of DCs in the endometrium also interferes with embryo implantation and leads to early embryo resorption, which is related to impaired decidualization and reduced vasodilation 10  www.nature.com/scientificreports/ chemokines. They differentiate into macrophages or DCs at the onset of pregnancy, participating in regulating maternal-fetal immunity 11 . Collectively, RPL is hypothesized to have a common etiology of compromised endometrial immunity.
Recently, major functional genes in several diseases have been discovered using microarray technology and thorough bioinformatics analysis, which can then be employed as diagnostic and predictive biomarkers [12][13][14] . Finding illness biomarkers is frequently done using machine learning (ML) techniques. We can account for the magnitude and direction of interactions between predictors and outcomes using Support Vector Machines-Recursive Feature Elimination (SVM-RFE) in machine learning 15 . A gene expression-based deconvolution technique called CIBERSORT is employed to evaluate immune cell infiltration 16 . To the best of our knowledge, however, the combination study of SVM-RFE, LASSO, Random Forest (RF), and CIBERSORT has not been used to identify putative biomarkers of RPL and forecast immune cell infiltration in RPL patients.
The purpose of this study was to screen for novel biomarkers in the endometrium associated with RPL using ML techniques. In addition, we used the CIBERSORT algorithm to assess immune cell infiltration in RPL and analyzed the relationship between biomarker expression and immune cell infiltration.

Materials and methods
Preprocessing and collection of data. In Fig. 1, we can see the workflow of the research. GSE165004 and GSE26787 were downloaded from the Gene Expression Omnibus (GEO) database in NCBI 17 . GSE165004 was based on the GPL16699 platform, which contained endometrial tissues of 24 RPL women and 24 controls. And GSE26787 was based on GPL570, consisting of 5 RPL patients and 5 control endometrial samples. With R Figure 1. The workflow of the study. www.nature.com/scientificreports/ packages "limma" and "sva", two datasets were then merged and batch-normalized 18 . With the help of R software, 40 RPL and 18 healthy women were randomly divided into the training and testing cohorts in a ratio of two to one for the following analysis.
Calculation of differentially expressed genes. Using the "limma" R package, we gained differences in gene expression between RPL and control tissues, and DEGs were set to |log2FC|> 1.0 along with P-value < 0.05. For visualizations of DEGs, heatmaps and volcano graphs were produced by "ggplot2" and "pheatmap" packages in R.
Enrichment assessment of DEGs. The Gene Ontology (GO), Disease Ontology (DO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways 19 were assessed for functional enrichment using the "cluster-Profiler" package of R to discover the underlying biological functions of DEGs.
Selection of optimal feature genes (OFGs). To identify OFGs, we selected three different machine learning (ML) algorithms. "Glmnet", an R package, constructed the Least absolute shrinkage and selection operator (LASSO) binary logistic regression model. The optimal penalty parameter was determined and used in every signature using a tenfold cross-validation minimum 20 . The SVM-RFE, a nonlinear support vector machine implemented in the R package "e1071", "kernlab", and "caret", is applied to determine the OFGs 15 . To filter OFGs, we used the R package "randomforest" to generate 500 trees for every datapoint and retained the top 5 key genes 21 . Furthermore, the venn graph exhibited the OFGs of the intersection of three machine learning.

Diagnostic quality verification of OFGs.
A testing set was applied to verify screened OFGs as a validation step. To visualize the expression of crucial OFGs in the RPL and control women of testing set, we constructed boxplots using the R packages "ggplot2″ and "ggpubr". Area under curve (AUC) was applied to assess the predictive value of OFGs using the receiver operating characteristic (ROC) curve computed by the "pROC" package in R 22 . Considering this, OFGs were recognized as prospective biomarkers with highly predictive and diagnostic capabilities once the AUC exceeded 0.85.

Assessment of immune cell Infiltration.
Utilizing the CIBERSORT algorithm (https:// ciber sort. stanf ord. edu/), we computed the infiltrating abundances and differences of 22 immune cells. The outcomes were visualized with heatmap, and violin graph produced with the "corrplot" and "ggplot2" packages in R 16 . The relationship between OFGs and immune cells was estimated by spearman correlation coefficient.
Statistical analysis. Every statistical calculation and graph were executed by R (version 4.2.2). It was confirmed once p-value less than 0.05 as statistically significant.

Identification of DEGs.
Here is a diagram of the study's workflow (Fig. 1). DEGs were generated based on the training dataset, which contained endometrium tissues from 20 RPL patients and 20 controls. A comparison between RPL and control women revealed 42 DEGs with 28 upregulated and 14 downregulated genes. Heatmaps and volcano graphs were produced correspondingly to display the consequences (Fig. 2).
Enrichment assessment of DEGs. To complete the enrichment analysis of DEGs by GO/DO/KEGG, we applied the "clusterProfiler" package in R. For GO analysis, In RPL patients, the biological process, cellular component, and molecular function associated with anion transport, collagen-containing extracellular matrix and receptor-ligand activity were identified as the most enriched functions (Fig. 3A). KEGG analysis primarily targeted on Ras signalling pathway, and PI3K-Akt signalling pathway (Fig. 3B). Furthermore, DO analysis revealed tight associations of DEGs with myocardial infarction, bone remodelling disease and peptic ulcer disease in RPL women (Fig. 3C).
Screening and validating OFGs. In the RPL-related DEGs, the LASSO and RF algorithms each screened five valuable genes. Additionally, the SVM-REF algorithm was applied to filter six crucial genes. The four intersecting OFGs are known as zinc finger protein 90 (ZNF90), putative translationally controlled tumor proteinlike protein TPT1P8 (TPT1P8), fibroblast growth factor 2 (FGF2) and family with sequence similarity 166, member B (FAM166B) (Fig. 4). In the RPL, ZNF90, TPT1P8, and FGF2 are down-regulated and FAM166B is up-regulated. As verified by the testing set, the expression of OFGs were noticeably decreased in RPL women, except for FAM166B ( Fig. 5A-D). To assess their diagnostic effectiveness, we produced ROC curves for the OFGs in testing cohort. All OFGs displayed excellent diagnostic results with AUCs exceeding 0.88 (Fig. 5E-H). Consequently, ZNF90, TPT1P8, FGF2 and FAM166B were identified to be promising biomarkers for diagnosing RPL.

Assessment of immune cell infiltration.
In further analysis, we applied the CIBERSORT algorithm to discover the pertinent proportions of 22 different immune cell types. As shown in the bar chart, each sample has a different proportion of immune cell subpopulations (Fig. 6A). The violin graph demonstrated that infiltration of monocytes was strongly significant in the RPL group, whereas infiltration of γδ T cells exhibited remarkably significant in control dataset (Fig. 6B). Moreover, the interaction among immune cells revealed that regulatory www.nature.com/scientificreports/ www.nature.com/scientificreports/ T cells exhibited the most positive correlation with M0 macrophages, although CD8 T cells negatively exhibited relationship with CD4 memory resting T cells (Fig. 6C).

Relationships between immune cells and biomarkers. Correlation analyses were executed to estimate
the connections between immune cells and biomarkers. We discovered that the down-regulated genes: ZNF90 and TPT1P8 were positively related to γδ T cells (ZNF90: R = 0.324, P = 0.041; TPT1P8: R = 0.328, P = 0.034), but negatively associated with monocytes (ZNF90: R = -0.649, P < 0.001; TPT1P8: R = − 0.418, P = 0.007). M2 macrophages and plasma cells were positively linked to down-regulated FGF2 and negatively connected to upregulated FAM166B, whereas CD4 resting memory T cells were inversely related to these two genes. In addition, ZNF90 was also concerned with eosinophils and naive B cells; FAM166B was associated with monocytes and resting dendritic cells (Fig. 7).

Discussion
RPL is still a major health concern in reproductive medicine, creating a significant psychological burden to individuals because 50% of RPL is idiopathic and evidence-based therapy is restricted. Currently, machine learning algorithms are excellent tools for analyzing underlying linkages and selecting ideal parameters for gene selection among all DEGs of biological significance in high-dimensional data. The discovery of new genes as potential biomarkers, as well as the study of immune cell infiltration features, will have a substantial impact on www.nature.com/scientificreports/ the early diagnosis and prediction of RPL. In this study, we identified a total of 42 DEGs, of which 28 genes were upregulated, and 14 were downregulated, based on the gene expression datasets of RPL and normal controls.
Multifunctional enrichment analysis showed that these DEGs were related to MAPK signaling pathway, PI3K-Akt signaling pathway, inflammation and immune responses. Then based on three machine learning algorithms (LASSO regression model, RF algorithm and SVM-RFE algorithm), we screened out 4 best eigengenes (FGF2, FAM166B, ZNF90 and TPT1P8). Finally, we revealed the relationship between the 4 OFGs and immune cells using the CIBERSORT algorithm. Fibroblast growth factor (FGF) regulates cell fate, angiogenesis, immunity, and metabolism through signalling through its receptors FGFR1, FGFR2, FGFR3 or FGFR4. According to research, dysregulation of FGF signaling causes human diseases such as lung, breast, and stomach cancer, as well as achondroplasia 23 . Furthermore, FGF and its receptors are major factors in fetal and placental angiogenesis. The FGF signaling process regulates www.nature.com/scientificreports/ immunity dynamically as well as regulating inflammation and tissue repair by immune cells. After FGFR1/2 signaling and VEGF/ANGPT2 secretion, FGF2 can promote endothelial cell proliferation and migration, for example 24,25 . FGF1 and FGF2 promote neutrophil chemotaxis to damaged tissues through FGFR2 signaling 26 .
In addition, Cox reported that implantation of exogenous FGF-2 into the quail embryonic environment induced angiogenic cells and patterned blood vessel formation 27 . Thus, low FGF2 expression may contribute to RPL by hindering embryonic angiogenesis. FAM166B is a gene that has yet to be studied in depth. Previous studies involving FAM166B, which focused more on expression in multiple symmetric lipidosis and skeletal muscle, showed that FAM166B is highly expressed in adrenal glands and ciliated cells, but its precise function remains unclear 28 . A recent study showed that FAM166B expression correlates with breast cancer prognosis. The study found that the expression level of FAM166B in breast cancer was negatively correlated with the level of macrophage infiltration and positively correlated with the expression of CD 4+ T cells, which suggesting that the recruitment and regulation of immune infiltrating cells may be mediated by FAM166B in breast cancer 29 . Zinc finger proteins (ZFPs) are the largest family of transcription factors characterized by finger-like DNAbinding domains that play an important role in metabolic processes, autophagy, apoptosis, immune responses, differentiation, and stem cell maintenance 30 . However, only a few studies have reported the involvement of ZFPs in immune-related processes, such as immune response, immune homeostasis, and cytokine production recently 31,32 . ZFPs bind to Zinc, which is involved in the developmental process of oocytes. Abu-Soud reported that zinc deficiency leads to high ROS production in oocytes, which affects oocyte quality and female fertility by interfering with physiological antioxidant mechanisms that act on biomolecular, protein and cellular processes 33 .
In the present study, low expression of ZFPs caused a decrease in binding efficiency to zinc, which may lead to RPL. Currently, there are few studies on TPT1P8, also known as FKSG2. The latest literature found that the significantly lower expression of TPT1P8 in the anterior cingulate cortex (ACC) of Cushing's disease (CD) patients was associated with immune function 34 . Accordingly, the OFGs screened in this study are involved in signaling transduction, inflammation and immune responses, which may contribute to RPL occurrence and progression.
Based on the CIBERSORT analysis, we found that RPL and the control group had significantly different levels of immune cell infiltration, especially monocytes and γδ T cells. Our study found that RPL samples had higher levels of monocyte infiltration. Consistent with our conclusions, previous studies also showed that women with RPL had higher monocyte concentrations detected in peripheral blood than normal fertile controls 35 . During normal pregnancy, immune cells at the fetal-maternal interface increase, such as uterine NK cells and macrophages. Monocytes are short-lived cells that arise from monocyte precursors in the bone marrow and makeup approximately 5-10% of the total number of circulating white blood cells 36 . Accumulating evidence suggests that circulating monocytes are recruited to the decidua at the onset of pregnancy to generate macrophages with essential immune functions. Thus, decidual macrophages contribute to maternal tolerance to fetal antigens 11 . Monocytes also present antigens to T cells, which regulate the adaptive immune response. In addition, they are involved in fundamental processes of a successful pregnancy, such as trophoblast invasion and tissue and vascular remodeling 37 . In addition, our results also showed that γδ T cells were decreased in RPL samples compared with normal controls. According to the T cell receptor (TCR), T cells are divided into αβ T and γδ T cells, which express αβ TCR and γδ TCR, respectively 38 . γδ T cells play numerous roles in establishing and maintaining immune tolerance in early pregnancy but are often overlooked. γδT cells are increased in the early decidua of normal pregnancy. They secrete anti-inflammatory cytokines such as IL-10 and TGF-β and transduce negative signals by expressing regulatory molecules such as PD-1, Tim-3, and CD 160 39,40 . From this, we speculate that dysregulation of endometrial monocytes and γδ T cells in women with RPL biases the maternal immune system towards pro-inflammatory properties, which may ultimately lead to RPL.
In our study, according to the correlation analysis, the four characteristic genes screened are related to immune cell infiltration of RPL. In Fig. 7, the expression of four candidate genes showed strong correlation with monocytes, but weak correlation with other differentially infiltrated immune cells. Comins-Boo suggests that monocyte dysregulation is a major factor contributing to RPL. We therefore hypothesize that key genes interacting with monocytes may promote the development of RPL. However, the specific mechanism for the weak correlation of γδ T cells and plasma cells with key genes is unknown and may be related to the small sample size.
The integration of microarray technology, bioinformatics analysis, and ML algorithms has become a hotbed for biomarker screening, diagnosis prediction, and prognosis evaluation of complicated diseases in recent years. Moreover, computational biology methods can provide the basis for further basic experimental design. In this study, the combination of the LASSO model, RF algorithm and SVM-RFE algorithm was applied to identify potential biomarkers of RPL, as it has rarely been done before. This study, however, has limited data, and more external data, clinical samples, and prospective clinical trials are needed in the future to verify the results.

Conclusion
In this study, we found that ZNF90, TPT1P8, FGF2 and FAM166B could serve as candidate biomarkers for RPL, and we explored their correlations with immune cells in the pathogenesis of RPL. In addition, the differential infiltration of monocytes and γδ T cells is related to the onset and progression of RPL. Future studies with larger sample sizes and more predictive clinical measures are necessary for validating these results.